session info
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] tictoc_1.0.1 doParallel_1.0.16 iterators_1.0.13 foreach_1.5.1
##
## loaded via a namespace (and not attached):
## [1] codetools_0.2-18 digest_0.6.27 R6_2.5.1 jsonlite_1.7.2
## [5] magrittr_2.0.1 evaluate_0.14 rlang_0.4.11 stringi_1.7.4
## [9] jquerylib_0.1.4 bslib_0.3.0 rmarkdown_2.11 tools_4.0.2
## [13] stringr_1.4.0 xfun_0.26 yaml_2.2.1 fastmap_1.1.0
## [17] compiler_4.0.2 htmltools_0.5.2 knitr_1.34 sass_0.4.0
We download two datasets:
We need to do some cleaning so that the gene names across the two datasets can be matched. This may not be optimal or fully correct but it seems to give a reasonable number of matches.
## Rows: 7091 Columns: 48
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): Oligo_ID, NAME
## dbl (46): TP 1, TP 2, TP 3, TP 4, TP 5, TP 6, TP 7, TP 8, TP 9, TP 10, TP 11...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 4838 Columns: 27
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (3): gene, locus_tag, transcript_ID
## dbl (24): kali320, kali310, kali355, kali351, kali357, kali244, kali326, kal...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Lemieux et al (PNAS, 2008) proposed a simple approach for estimating the hours post invasion (hpi) for an in vivo sample. The reference data \(z\) from Bozdech et al can be thought of a cyclical process \(z_g\) for each gene \(g\). They proposed first to estimate \(z_g\) using smoothing splines. Then the “distance” of the data \(x\) and the reference genome at each timepoint \(t\in [0,48]\) can be calculated as the sum of squared residuals \(z_g(t) - x_g\) (i.e. Gaussian likelihood for the residuals). The MLE \(t^*\) is then the value of \(t\) that minimises these residuals.
A key issue is that the reference data \(z\) and the testing data \(x\) are not in the same units. The reference data are microarray data (intensity values on a microarray); the data from Andrade et al are RNA seq data: Illumina read counts (going from 0 to around 100,000).
We use the original code written by Avi Feller (R script Avi_Feller_functions.R).
## There are 3262 gene names that match across Bozdech reference transcriptome and the Andrade et al RNA seq dataset
## The difference between the median values in each group is 4.45 hours
A population of parasites within a host are never in complete synchrony (i.e. not all at exactly the same developmental stage). In symptomatic malaria we do usually see quite synchronous infections, for example the work by Fairley (1947) showed that parasitaemia over time followed an increasing sine-wave pattern. This is driven by concurrent schizogony followed by sequestration. Synchronicity in symptomatic malaria is partly driven by fever (see Kwiatkowski. Febrile temperatures can synchronize the growth of Plasmodium falciparum in vitro. J Exp Med. 1989).